Goto

Collaborating Authors

 stack trace


Stack Trace-Based Crash Deduplication with Transformer Adaptation

Mamun, Md Afif Al, Uddin, Gias, Xia, Lan, Zhang, Longyu

arXiv.org Artificial Intelligence

--Automated crash reporting systems generate large volumes of duplicate reports, overwhelming issue-tracking systems and increasing developer workload. Traditional stack trace-based deduplication methods--relying on string similarity, rule-based heuristics, or deep learning (DL) models--often fail to capture the contextual and structural relationships within stack traces. We propose dedupT, a transformer-based approach that models stack traces holistically rather than as isolated frames. Extensive experiments on real-world datasets show that dedupT outperforms existing DL and traditional methods (e.g., sequence alignment and information retrieval techniques) in both duplicate ranking and unique crash detection, significantly reducing manual triage effort. On four public datasets, dedupT improves Mean Reciprocal Rank (MRR) often by over 15% compared to the best DL baseline and up to 9% over traditional methods while achieving higher Receiver Operating Characteristic Area Under the Curve (ROC-AUC) in detecting unique crash reports. Our work advances the integration of modern natural language processing (NLP) techniques into software engineering, providing an effective solution for stack trace-based crash deduplication. Software issues are generally reported through (1) human-submitted reports and (2) automated crash reports. Human-reported issues typically include textual descriptions detailing the issue, expected and observed behavior, and may include attachments such as images or videos. In contrast, automated crash reports are generated by crash reporting tools (e.g., Sentry However, these automated systems often overwhelm ITS platforms by generating numerous duplicate crash reports for the same issue, requiring developers to manually review and triage them, which is a time-consuming process. For instance, Mozilla Firefox received 2.2 million issues in the first week of 2016, the majority being duplicates [1], while 72% of crash reports in the IntelliJ Platform were found to be duplicates [2]. In such scenarios, grouping similar crashes together is essential, a process known as crash deduplication . Unlike human-written reports with detailed descriptions, automated crash reports primarily contain technical data like stack traces and crash dumps. Figure 1: Example of a Java stack trace. Figure 1: Example of C++ stack trace.


Hear Your Code Fail, Voice-Assisted Debugging for Python

Amiri, Sayed Mahbub Hasan, Islam, Md. Mainul, Hossen, Mohammad Shakhawat, Amiri, Sayed Majhab Hasan, Mamun, Mohammad Shawkat Ali, Kabir, Sk. Humaun, Akter, Naznin

arXiv.org Artificial Intelligence

This staggering performance drain translates to roughly $61 billion in yearly financial losses throughout the worldwide software market, as quantified by the Standish Group's 2023 analysis of advancement workflows. The core inefficiency stems from traditional debugging's visual - only paradigm, where deve lopers must manually parse dense, technical stack traces while mentally reconstructing error context a process requiring intense cognitive focus that fragments attention between code logic and exception diagnostics. Neuroergonomic research from MIT's Human - Computer Interaction Lab reveals this context - switching triggers measurable cognitive overload, increasing prefrontal cortex activation by 60% compared to focused coding tasks, ultimately leading to mental fatigue that compounds debugging errors. The accessibility limitations of conventional debugging tools create additional barriers for approximately 12.5% of professional developers with visual impairments (World Health Organization, 2024), who struggle with screen readers that poorly interpret te chnical tracebacks. As Sarah Parker, a blind Python developer at Microsoft, testified during the 2023 Accessible Tech Symposium: "NVDA reads exception blocks as disconnected fragments I spend more time reassembling error narratives than solving actual prob lems."


A Tool for Generating Exceptional Behavior Tests With Large Language Models

Zhong, Linghan, Yuan, Samuel, Zhang, Jiyang, Liu, Yu, Nie, Pengyu, Li, Junyi Jessy, Gligoric, Milos

arXiv.org Artificial Intelligence

Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing "happy paths", e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at https://youtu.be/Jro8kMgplZk).


Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces

Jambigi, Neetha, Bogacz, Bartosz, Mueller, Moritz, Bach, Thomas, Felderer, Michael

arXiv.org Artificial Intelligence

Abrupt and unexpected terminations of software are termed as software crashes. They can be challenging to analyze. Finding the root cause requires extensive manual effort and expertise to connect information sources like stack traces, source code, and logs. Typical approaches to fault localization require either test failures or source code. Crashes occurring in production environments, such as that of SAP HANA, provide solely crash logs and stack traces. We present a novel approach to localize faults based only on the stack trace information and no additional runtime information, by fine-tuning large language models (LLMs). We address complex cases where the root cause of a crash differs from the technical cause, and is not located in the innermost frame of the stack trace. As the number of historic crashes is insufficient to fine-tune LLMs, we augment our dataset by leveraging code mutators to inject synthetic crashes into the code base. By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the HANA code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9\% while baselines only achieve 12.6% and 10.6%. We substantiate the generalizability of our approach by evaluating on two additional open-source databases, SQLite and DuckDB, achieving accuracies of 63% and 74%, respectively. Across all our experiments, fine-tuning consistently outperformed prompting non-finetuned LLMs for localizing faults in our datasets.


The Impact of Input Order Bias on Large Language Models for Software Fault Localization

Rafi, Md Nakhla, Kim, Dong Jae, Chen, Tse-Hsun, Wang, Shaowei

arXiv.org Artificial Intelligence

Large Language Models (LLMs) show great promise in software engineering tasks like Fault Localization (FL) and Automatic Program Repair (APR). This study examines how input order and context size affect LLM performance in FL, a key step for many downstream software engineering tasks. We test different orders for methods using Kendall Tau distances, including "perfect" (where ground truths come first) and "worst" (where ground truths come last). Our results show a strong bias in order, with Top-1 accuracy falling from 57\% to 20\% when we reverse the code order. Breaking down inputs into smaller contexts helps reduce this bias, narrowing the performance gap between perfect and worst orders from 22\% to just 1\%. We also look at ordering methods based on traditional FL techniques and metrics. Ordering using DepGraph's ranking achieves 48\% Top-1 accuracy, better than more straightforward ordering approaches like CallGraph. These findings underscore the importance of how we structure inputs, manage contexts, and choose ordering methods to improve LLM performance in FL and other software engineering tasks.


exLong: Generating Exceptional Behavior Tests with Large Language Models

Zhang, Jiyang, Liu, Yu, Nie, Pengyu, Li, Junyi Jessy, Gligoric, Milos

arXiv.org Artificial Intelligence

Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on "happy paths", e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction fine-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted.


Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios

Shibaev, Egor, Sushentsev, Denis, Golubev, Yaroslav, Khvorov, Aleksandr

arXiv.org Artificial Intelligence

In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search to quickly find the most relevant stack traces to the incoming one, and (2) a reranker that re-ranks the most fitting stack traces, taking into account the repeated frames between them. To complement the existing datasets collected from open-source projects, we share with the community SlowOps - a dataset of stack traces from IntelliJ-based products developed by JetBrains, which has an order of magnitude more stack traces per category. Finally, we carry out an evaluation that strives to be realistic: measuring not only the accuracy of categorization, but also the operation time and the ability to create new categories. The evaluation shows that our model strikes a good balance - it outperforms other models on both open-source datasets and SlowOps, while also being faster on time than most. We release all of our code and data, and hope that our work can pave the way to further practice-oriented research in the area.


Labeling questions inside issue trackers

Rasti, Aidin

arXiv.org Artificial Intelligence

One of the issues faced by the maintainers of popular open source software is the triage of newly reported issues. Many of the issues submitted to issue trackers are questions. Many people ask questions on issue trackers about their problem instead of using a proper QA website like StackOverflow. This may seem insignificant but for many of the big projects with thousands of users, this leads to spamming of the issue tracker. Reading and labeling these unrelated issues manually is a serious time consuming task and these unrelated questions add to the burden. In fact, most often maintainers demand to not submit questions in the issue tracker. To address this problem, first, we leveraged dozens of patterns to clean text of issues, we removed noises like logs, stack traces, environment variables, error messages, etc. Second, we have implemented a classification-based approach to automatically label unrelated questions. Empirical evaluations on a dataset of more than 102,000 records show that our approach can label questions with an accuracy of over 81%.


ChatDBG: An AI-Powered Debugging Assistant

Levin, Kyla, van Kempen, Nicolas, Berger, Emery D., Freund, Stephen N.

arXiv.org Artificial Intelligence

This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.


Combining Machine Learning and Lifetime-Based Resource Management for Memory Allocation and Beyond

Communications of the ACM

Memory management is a decades-old research area24 that is fundamental to the performance of all applications. On modern architectures, memory managers determine a workload's ability to use 2MB (and 1GB) huge pages instead of traditional 4KB pages. The use of huge pages is crucial for performance on modern servers since they substantially reduce the cost of address translation by producing a wider reach in Translation Lookaside Buffers (TLB), reducing misses on the CPU's critical path.5 Current huge page-aware memory managers13 trade-off huge page usage with memory utilization, breaking up huge pages when they become inefficient. Figure 1 visualizes the source of this trade-off: When a C program allocates memory, it calls into a memory allocator library (e.g., TCMalloc13), which places the object at a particular address in memory until the program deletes it. The object may not move.